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Abstract 

Background: The maximum clique enumeration (MCE) problem asks that we identify all maximum cliques in a 
finite, simple graph. MCE is closely related to two other well-known and widely-studied problems: the maximum 
clique optimization problem, which asks us to determine the size of a largest clique, and the maximal clique 
enumeration problem, which asks that we compile a listing of all maximal cliques. Naturally, these three problems 
are AAp-hard, given that they subsume the classic version of the AAP -complete clique decision problem. MCE 
can be solved in principle with standard enumeration methods due to Bron, Kerbosch, Kose and others. 
Unfortunately, these techniques are ill-suited to graphs encountered in our applications. We must solve MCE on 
instances deeply seeded in data mining and computational biology, where high-throughput data capture often 
creates graphs of extreme size and density. MCE can also be solved in principle using more modern algorithms 
based in part on vertex cover and the theory of fixed-parameter tractability (FPT). While PPT is an improvement, 
these algorithms too can fail to scale sufficiently well as the sizes and densities of our datasets grow. 

Results: An extensive testbed of benchmark graphs are created using publicly available transcriptomic datasets 
from the Gene Expression Omnibus (GEO). Empirical testing reveals crucial but latent features of such high- 
throughput biological data. In turn, it is shown that these features distinguish real data from random data intended 
to reproduce salient topological features. In particular, with real data there tends to be an unusually high degree of 
maximum clique overlap. Armed with this knowledge, novel decomposition strategies are tuned to the data and 
coupled with the best FPT MCE implementations. 

Conclusions: Several algorithmic improvements to MCE are made which progressively decrease the run time on 
graphs in the testbed. Frequently the final runtime improvement is several orders of magnitude. As a result, 
instances which were once prohibitively time-consuming to solve are brought into the domain of realistic 
feasibility. 



Background 

A clique is a fully-connected subgraph in a finite, simple 
graph. The problem of determining whether or not a 
graph has a clique of a given size, called simply CLI- 
QUE, is one of the best known and most widely studied 
combinatorial problems. Although classically formulated 
as an /^"P -complete decision problem [1], where one is 
merely asked to determine the existence of a certain size 
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clique, the search and optimization formulations are 
probably most often encountered in practice, where one 
is asked to find a clique of given size and largest size 
respectively. In computational biology, one needs to 
look no farther than PubMed to gauge clique's utility in 
a variety of applications. A notable example is the 
search for putative molecular response networks in 
high-throughput biological data. Popular clique-centric 
tools include clique community algorithms for clustering 
[2] and paraclique-based methods for QTL analysis and 
noise abatement [3,4]. 
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A clique is maximal if it cannot be augmented by add- 
ing additional vertices. A clique is maximum if it is of 
largest size. A maximum clique is particularly useful in 
our work on graphs derived from biological datasets. It 
provides a dense core that can be extended to produce 
plausible biological networks [5]. Other biological applica- 
tions include the thresholding of normalized microarray 
data [6,7], searching for common ds-regulatory elements 
[8], and solving the compatibility problem in phylogeny 
[9]. See [10] for a survey of additional applications of max- 
imum clique. 

Any algorithm that relies on maximum clique, however, 
has the potential for inconsistency. This is because graphs 
often have more than just one maximum clique. Idiosyn- 
crasies between algorithms, or even among different 
implementations of the same algorithm, are apt to lead to 
an arbitrary choice of cliques. This motivates us to find an 
efficient mechanism to enumerate all maximum cliques in 
a graph. These can then be examined using a variety of 
relevant criteria, for example, by the average weight of 
correlations driven by strain or stimulus [11]. 

We therefore seek to solve the Maximum Clique 
Enumeration (MCE) problem. Unlike maximal clique enu- 
meration, for which a substantial body of literature exists, 
very litde seems to be known about MCE. The only excep- 
tion we have found is a game-theoretic approach for locat- 
ing a predetermined number of largest cliques [12]. 

While very little prior work seems to have been done 
on MCE, the problem of maximal clique enumeration 
has been studied extensively. Since any algorithm that 
enumerates all maximal cliques also enumerates all maxi- 
mum cliques, it is reasonable to approach MCE by 
attempting first to adapt existing maximal clique enu- 
meration algorithms. An implementation of an existing 
maximal clique enumeration algorithm also provides a 
useful runtime benchmark that should be improved upon 
by any new approach. Besides maximal clique enumera- 
tion algorithms, another potential strategy is to compute 
the maximum clique size and then test all possible com- 
binations of vertices of that size for connectivity. While 
this approach may be reasonable for very small clique 
sizes, as the maximum clique size increases the runtime 
quickly becomes prohibitive, and we mention it only for 
completeness, and focus our efforts on modifying and 
extending existing algorithms for enumerating maximal 
cliques. 

Current maximal clique enumeration algorithms can be 
classified into two general types: iterative enumeration 
(breadth-first traversal of a search tree) and backtracking 
(depth-first traversal of a search tree). Iterative enumera- 
tion algorithms, such as the method suggested by Kose 
et al [13], enumerate all cliques of size k at each stage, 
test each one for maximality, then use the remaining cli- 
ques of size k to build cliques of size k + 1. The process 



is typically initialized for A: = 3 by enumerating all vertex 
subsets of size 3 and testing for connectivity. In practice, 
such an approach can have staggering memory require- 
ments, because all cliques of a given size must be 
retained at each step. In [14], this approach is improved 
by using efficient bitwise operations to prune the number 
of cliques that must be saved. Nevertheless, storage needs 
can be excessive, since all maximal cliques of one size 
must still be made available before moving on to the next 
larger size. Figure 1 shows the number of maximal cli- 
ques of each size in one of the graphs near the median 
size in our testbed. This graphic illustrates the enormous 
lower bounds on memory that can be encountered with 
iterative enumeration algorithms. 

Many variations of backtracking algorithms for maxi- 
mal clique enumeration have been published in the lit- 
erature. To the best of our knowledge, all can be traced 
back to the algorithms of Bron and Kerbosch first pre- 
sented in [15]. Some subsequent modifications tweak the 
data structures used. Others change the order in which 
vertices are traversed. See [16] for a performance com- 
parison between several variations of backtracking algo- 
rithms. As a basis for improvement, however, we 
implemented the original, highly efficient algorithm of 
[15]. We made this choice for three reasons. First, an 
enormous proportion of the time consumed by enumera- 
tion algorithms is spent in outputting the maximal cli- 
ques that are generated. This output time is a practical 
limitation on any such approach. Second, a graph can 
theoretically contain as many as 3^"'^^ maximal cliques 
[17]. It was shown in [18] that the algorithm in [15] 
achieves this bound in the worst case. No algorithm with 
a theoretically lower asymptotic runtime can thus exist. 
Third, and most importantly, the improvements we 
introduce do not depend on the particulars of any one 
backtracking algorithm; they can be used in conjunction 
with any and all of them. 

Results and discussion 

Using the seminal maximal clique enumeration algorithm 
due to Bron and Kerbosch [15] as a benchmark, we 
designed, implemented, and extensively tested three algo- 
rithmic improvements, the last based on observations 
about the nature of graphs produced by transcriptomic 
data. Along with describing these improvements, we will 
describe our existing tool for finding a single maximum 
clique, based on the theory of fixed-parameter tractability 
(FPT) [19,20]. Such a tool is essential for all three 
improvements, since the first two rely on knowledge of 
the maximum clique size, and the last uses the maximum 
clique finding tool as a subroutine. All codes are written in 
C/C-I--1- and compiled in Linux. For testing, we use 100 
graphs derived from 25 different datasets which are pub- 
licly available on GEO. We concentrate on transcriptomic 
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Figure 1 Maximal Clique Profile. The maximal clique profile of a graph created from the GDS3672 dataset using a threshold value of 0.81, the 
dataset's second highest threshold. MCE algorithms that are based on a breadth-first traversal of the search tree will retain at each step all 
maximal cliques of a given size. This can lead to titanic memory requirements. This graph, for example, contains more than 110 million maximal 
cliques of size 70. These sort of memory demands tend to render non-backtracking methods impractical. 



data, given its abundance, and eschew synthetic data, hav- 
ing learned long ago that effective algorithms for one have 
little bearing on the other. (The pathological matchings 
noted in [21] for vertex cover can be extended to clique, 
but likewise they too are of course hugely irrelevant to real 
data.) In an effort to improve performance, we scrutinize 
the structure of transcriptomic graphs and explore the 
notion of maximum clique covers and essential vertex 
sets. Indeed, we find that with the right preprocessing we 
are able to tailor algorithms to the sorts of data we routi- 
nely encounter, and that we can now solve instances pre- 
viously considered unassailable. 

Algorithms 

In the following sections, we describe each of the MCE 
algorithms we implemented and tested. The first is the 
algorithm of Bron and Kerbosch [15], which we call 
Basic Backtracking and use as a benchmark. Since all 
our subsequent improvements make use of an algorithm 
that finds a single maximum clique, we next describe 
our existing tool, called Maximum Clique Finder (MCF), 
which does just that. We next modify the Basic Back- 
tracking algorithm to take advantage of the fact that we 
only want to find the maximum cliques and can quickly 
compute the maximum clique size. We call this 
approach Intelligent Backtracking, since it actively 
returns early from branches that will not lead to a maxi- 
mum clique. We then modify MCF itself to enumerate 
all maximum cliques, an approach we call Parameter- 
ized Maximum Clique, or Parameterized MC. In a sense 
this is another backtracking approach that goes even 
further to exploit the fact that we only want to find 
maximum cliques. Finally, based on observations about 
the properties of biological graphs, we introduce the 
concepts maximum clique covers and essential vertex 
sets, and apply them to significantly improve the run- 
time of backtracking algorithms. 



Basic backtracking 

The seminal maximal clique publication [15] describes 
two algorithms. A detailed presentation of the second, 
which is an improved version of the first, is provided. It 
is this second, more efficient, method that we imple- 
ment and test. We shall refer to it here as Basic Back- 
tracking. All maximal cliques are enumerated with a 
depth-first search tree traversal. The primary data struc- 
tures employed are three global sets of vertices: COMP- 
SUB, CANDIDATES and NOT. COMPSUB contains 
the vertices in the current clique, and is initially empty. 
CANDIDATES contains unexplored vertices that can 
extend the current clique, and initially contains all ver- 
tices in the graph. NOT contains explored vertices that 
cannot extend the current clique, and is initially empty. 
Each recursive call performs three steps: 

♦ Select a vertex v in CANDIDATES and move it to 
COMPSUB. 

• Remove all vertices not adjacent to v from both 
CANDIDATES and NOT. At this point, if both 
CANDIDATES and NOT are empty, then COMP- 
SUB is a maximal clique. If so, output COMPSUB as 
a maximal cique and continue the next step. If not, 
then recursively call the previous step. 

. Move V from COMPSUB to NOT. 

Note that NOT is used to keep from generating dupli- 
cate maximal cliques. The search tree can be pruned by 
terminating a branch early if some vertex of NOT is 
connected to all vertices of CANDIDATES. 

Vertices are selected in a way that causes this pruning to 
occur as soon as possible. We omit the details since they 
are not pertinent to our modifications of the algorithm. 

The storage requirements of Basic Backtracking are 
relatively modest. No information about previous maxi- 
mal cliques needs to be retained. In the improvements 
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we will test, we focus on speed but also improve mem- 
ory usage. Thus, such limitations are in no case prohibi- 
tive for any of our tested methods. Nevertheless, in 
some environments, memory utilization can be extreme. 
We refer the interested reader to [14]. 

Our Basic Backtracking implementation serves as an 
initial benchmark upon which we can now try to 
improve. 

Finding a single maximum clique 

We use the term Maximum Clique Finder (MCF) to 
denote the software we have implemented and refined 
for finding a single clique of largest size [22]. MCF 
employs a suite of preprocessing rules along with a 
branching strategy that mirrors the well-known FPT 
approach to vertex cover [19,23]. It first invokes a simple 
greedy heuristic to find a reasonably large clique rapidly. 
This clique is then used for preprocessing, since it puts a 
lower bound on the maximum clique size. The heuristic 
works by choosing the highest degree vertex, v, then 
choosing the highest degree neighbor of v. These two 
vertices form an initial clique C, which is then iteratively 
extended by choosing the highest degree vertex adjacent 
to all of C. On each iteration, any vertex not adjacent to 
all of C is removed. The process continues until no more 
vertices exist outside C. Since |C| is a lower bound on 
the maximum clique size, all vertices with degree less 
than I C - 1 1 can be permanently removed from the origi- 
nal graph. Next, all vertices with degree n - 1 are tem- 
porarily removed from the graph, but retained in a list 
since they must be part of any maximum clique. MCF 
exploits a novel form of color preprocessing [22], used 
previously in [24] to guide branching. This form of pre- 
processing attempts to reduce the graph as follows. 
Given a known lower bound k on the size of the maxi- 
mum clique, for each vertex v we apply fast greedy color- 
ing to V and its neighbors. If these vertices can be colored 
with fewer than k colors, then v cannot be part of a 



maximum clique and is removed from the graph. Once 
the graph is thus reduced, MCF uses standard recursive 
branching on vertices, where each branch assumes that 
the vertex either is or is not in the maximum clique. 
Intelligent backtracking 

Given the relative effectiveness with which we can find a 
single maximum clique, it seems logical to consider 
whether knowledge of that clique's size can be helpful 
in enumerating all maximum cliques. As it turns out, 
knowledge of the maximum clique size k leads to a 
small, straightforward change in the Basic Backtracking 
algorithm. Specifically, at each node in the search tree 
we check if there are fewer than k vertices in the union 
of COMPSUB and CANDIDATES. If so, that branch 
cannot lead to a clique of size k, and so we return. See 
Figure 2. While the modification may seem minor, the 
resultant pruning of the search tree can lead to a sub- 
stantial reduction in the search space. In addition to this 
minor change to branching, we apply color preproces- 
sing as previously described to reduce the graph before 
submitting it to the improved backtracking algorithm. 
Color preprocessing combined with the minor branch- 
ing change we call Intelligent Backtracking. 
Paramaterized enumeration 

Given that MCF employs a vertex branching strategy, 
we investigated whether it could be modified to enu- 
merate not just one, but all maximum cliques. It turns 
out that MCF, also, lends itself to a straightforward 
modification that results in enumeration of all maxi- 
mum cliques. The modification is simply to maintain a 
global list of all cliques of the largest size found thus 
far. Whenever a larger maximum clique is found, the 
list is flushed and refreshed to contain only the new 
maximum clique. When the search space has been 
exhausted, the list of maximum cliques is output. 

We must take special care, however, to note that cer- 
tain preprocessing rules used during interleaving are no 
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Figure 2 Intelligent Backtracking. A minor change to the Bron-Kerbosch algorithm uses the precomputed maximum clique size to trim the 
recursion tree. The input graph has typically been reduced using color preprocessing. %endfigure. 



Eblen et al. BMC Bioinformatics 2012, 13(Suppl 10):S5 
httpy/www.biomedcentral.coni/1 471-21 05/1 3/S1 0/S5 



Page 5 of 11 



longer valid. Consider, for example, the removal of a leaf 
vertex. The clique analogue is to find a vertex with 
degree n - 2 and remove its lone non-neighbor. This 
rule patently assumes that only a single maximum clique 
is desired, because it ignores any clique depending on 
the discarded vertex. Therefore this particular prepro- 
cessing rule must be omitted once branching has begun. 
Maximum clique covers 

If we view MCF as a black box subroutine that can be 
called repeatedly, it can be used in a simple greedy algo- 
rithm for computing a maximal set of disjoint maximum 
cliques. We merely compute a maximum clique, remove it 
from the graph, and iterate until the size of a maximum 
clique decreases. To explore the advantages of computing 
such a set, we introduce the following notion: 

Definition 1 A maximum clique cover of G = {V, E) is 
a set V Q V with the property that each maximum cli- 
que of G contains some vertex in the cover. 

The union of all vertices contained in a maximal set of 
disjoint maximum cliques is of course a maximum cli- 
que cover (henceforth MCC), because all maximum cli- 
ques must overlap with such a set. This leads to a useful 
reduction algorithm. Any vertex not adjacent to at least 
one member of an MCC cannot be in a maximum cli- 
que, and can thus be removed. 

In practice, we find that applying MCC before the ear- 
lier backtracking algorithms yields only marginal 
improvement. The concept of MCC does, however, lead 
to a much more powerful approach based on individual 
vertices. Since any improvement made by MCC is sub- 
sumed by the next approach, we do not test MCC by 
itself 

Essential vertex sets 

Our investigation of the MCC algorithm revealed that it 
typically does not reduce the size of the graph more than 
the preprocessing rules already incorporated into MCF. 
For example, MCF already quickly finds a lower bound on 
the maximum clique size and removes any vertex with 
degree lower than this bound. Upon closer examination, 
however, we found that for 74 of 75 graphs that we initi- 
ally tested for the conference version of this paper, only 
one clique was needed in an MCC. That is to say, one 
maximum clique covered all other maximum cliques. And 
in our current testbed of 100 graphs, in every case a single 
maximum clique suffices for an MCC. In fact this coin- 
cides closely with our experience, in which we typically see 
high overlap among large cliques in the transcriptomic 
graphs we encounter on a regular basis. Based on this 
observation, we shall now refine the concept of MCC. 
Rather than covering maximum cliques with cliques, we 
cover maximum cliques with individual vertices. 

We define an essential vertex as one that is contained in 
every maximum clique. Of course it is possible for a given 
graph to have no such vertex, even when it contains many 



overlapping maximum cliques. But empirical testing of 
large transcriptomic graphs shows that an overwhelming 
number contain numerous essential vertices. And for pur- 
poses of reducing the graph, even one will suffice. An 
essential vertex has the potential to be extremely helpful, 
because it allows us to remove all its non-neighbors. We 
employ the following observation: for any graph G, co{G) 
>m{Glv) if and only if v covers all maximum cliques, 
where co(G) is the maximum clique size of G. 

We define an essential set to be the set of all essential 
vertices. The Essential Set (ES) algorithm, as described 
in Figure 3, finds all essential vertices in a graph. It then 
reduces the graph by removing, for each essential vertex, 
all non-neighbors of that vertex. The ES algorithm can 
be run in conjunction with any of the backtracking 
MCE algorithms, or indeed prior to any algorithm that 
does MCE by any method, since its output is a reduced 
graph that still contains all maximum cliques from the 
original graph. As our tests show, the runtime improve- 
ment offered by the ES algorithm can be dramatic. 

Implementation 

We implemented all algorithms in either C or C++. The 
code was compiled using the GCC 4.4.3 compiler on the 
Ubuntu Linux version 10.04.2 operating system as well 
as the GCC 3.3.5 compiler under Debian Linux version 
3.1. All timings were conducted in the latter Debian 
environment on dedicated nodes of a cluster to ensure 
no affect on timings from concurrent processes. Each 
node had a dual-core Intel Xeon processor running at 
3.20 GHz and 4 GB of main memory. 

Testing 

In the conference version of this paper, we used three dif- 
ferent datasets at 25 thresholds each to derive a total of 
75 graphs on which to test our algorithmic improve- 
ments. While these graphs certainly sufficed as an initial 
proof of concept, two concerns could be raised regarding 
them. First, one might argue that three datasets are not a 
sufficiently large sample size to provide a true sense of 
the overall nature of transcriptomic data or an algorith- 
mic improvement's general effectiveness on such data, 
the large number of thresholds notwithstanding. And 
second, since the three datasets are proprietary and not 
publicly available, the results were not as readily reprodu- 
cible as they might otherwise have been. Obtaining de- 
identified versions, while feasible, was an unnecessary 
obtacle to reproducibility. 

We address such concerns here by creating a new suite 
of transcriptomic graphs on which to test our algorithmic 
improvements. The suite consists of graphs derived from 
25 datasets obtained from the Gene Expression Omnibus 
(GEO) [25], a publicly accessible repository. For each 
dataset, graphs were created at four different thresholds. 
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Figure 3 The Essential Set (ES) Algoritlim. The ES algorithm finds all essential vertices in a graph and removes their non-neighbors. 



for a total of 100 graphs. The datasets were selected to 
provide a reasonably diverse sampling of experimental 
type, species, and mRNA microarray chip type. They 
cover 8 different species and a number of different 
experimental conditions such as time series, strain, 
dose, and patient. Since our graphs are derived from 
thresholding correlation values, we excluded from con- 
sideration any dataset with fewer than 12 conditions. 
Thresholding correlations calculated using so few con- 
ditions can produce unacceptably large rates of false 
positives and false negatives. The number of conditions 
range from a low of 12 to a high of 153. Nine of the 
datasets had not been log-transformed, in which case 
we performed log-transformation. Four of the datasets 
contained missing values; in these cases we used corre- 
lation p-values rather than correlations for the thresh- 
old. See Table 1 for a listing of the GEO datasets used 
for testing. 

From the expression data, we first constructed 
weighted graphs in which vertices represented probes 
and edge weights were Pearson correlation coefficients 
computed across experimental conditions. We then con- 
verted the weighted graphs into unweighted graphs by 
retaining only those edges whose weights were at or 
above some chosen threshold, t. For each dataset, we 
chose four values for t. All size/density values were 
within the spectrum typically seen in our work with bio- 
logical datasets. The smallest graph had 3,828 vertices 
and 310,380 edges; the largest had 44,563 vertices and 
2,052,228 edges. 

The number of maximum cliques for the graphs in our 
testbed ranged from 8 to 74486. As seen with our pre- 
vious testbed, there was no discernible pattern based on 
graph size or density. One might ask why there is such 
wide, unpredictable variability. It turns out that the num- 
ber of maximum cliques can be extremely sensitive to 
small changes in the graph. Even the modification of a 
single edge can have a huge effect. Consider, for example, 
a graph with a unique maximum clique of size k, along 
with a host of disjoint cliques of size k - 1. The removal 



of just one edge from what was the largest clique may 
now result in many maximum cliques of size k - 1. Edge 
addition can of course have similar effects. See Figure 4 
for an illustrative example. 

For each algorithm on each graph, we conducted tim- 
ings on a dedicated node of a cluster to avoid interfer- 
ence from other processes. If the algorithm did not 
complete within 24 hours, it was halted and the graph 
was deemed to have not been solved. We chose thresh- 
olds to spread the runtimes of the graphs out over the 
five algorithms we were testing. The largest (smallest in 
the case of correlation p-value) threshold was selected so 
that a majority of the algorithms, if not all, solved the 
graph. The smallest (largest in the case of correlation p- 
value) threshold was selected so that at least one of the 
algorithms, but not all, solved the graph. 

On each graph we timed the performance of Basic Back- 
tracking, Intelligent Backtracking, and Paramaterized MC. 
We then reduced the graphs using ES and retested with 
Intelligent Backtracking and Parameterized MC, in which 
case the runtimes include both the reduction and the enu- 
meration step. As expected, Basic Backtracking was found 
to be non-competitive. Both Intelligent Backtracking and 
Parameterized MC showed a distinct, often dramatic, 
improvement over Basic Backtracking. Figure 5 shows the 
runtimes of each of the five methods on all 100 test 
graphs. On some of the easier graphs, ones taking less 
than three minutes to solve, the overhead of ES actually 
caused a minor increase in the overall runtime. But on the 
more difficult instances its true benefit became apparent, 
reducing runtime by an order of magnitude or more. And 
in all cases where two or fewer algorithms solved the 
graph, the algorithm was either ES with Intelligent Back- 
tracking, ES with Parameterized MC, or both. 

Conclusions 

ES serves as a practical example of an innovative algo- 
rithm tailored to handle a difficult combinatorial problem 
by exploiting knowledge of the input space. It succeeds 
by exploiting properties of the graphs of interest, in this 
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Table 1 GEO Datasets Used for Testing 


DataSet 


Title 


Organism 


GDS3505 


Seedling roots response to auxin and ethy ene availability 


Arabidopsis thaliana 


GDS3521 


Retina response to hypoxia and subsepuent reoxycjenation' time course 


Mus musculus 


GDS3538 


Age and diet effect on canine skeletal muscles 


Canis lupus familiaris 


GDS3561 


Occupational benzene exposure peripheral blood mononuclear ce s {HumanRef-8) 


Homo sapiens 


GDS3579 


Fer-1 null mutants 


Caenorhabditis elegans 


GDS3592 


Ovarian normal surface epithelia and ovarian cancer epithelia ce s 


Homo sapiens 


GDS3595 


Macrophage response to HlNl and H5N1 influenza viral infections 


Homo sapiens 


GDS3603 


Renal cancer response to rapamycin analog CCI-779 treatment! 


Homo sapiens 


GDS3605 


Spared nerve injury mode of peripheral neuropathic pain: dorsa horn of spinal cord 


Rattus norvegicus 


GDS3610 


Nasopharyngeal carcinoma 


Homo sapiens 


GDS3622 


Nrf2-deficient lung response to cigarette smoke: dose response and time course 


Mus musculus 


GDS3623 


Heart regeneration in zebrafish 


Danio rerio 


GDS3639 


Male and female fruit f ies of various wi d-type laboratory strains 


Drosophila melanogaster 


GDS3540 


Copper effect on liver cell line dose response and time course 


Homo sapiens 


GDS3644 


Cerebral palsy: wrist muscles 


Homo sapiens 


GDS3646 


Celiac disease: primary leukocytes 


Homo sapiens 


GDS3648 


Cardiomyocyte response to various types of fatty acids in vitro 


Rattus norvegicus 


GDS3661 


Hypertensive heart failure model 


Rattus norvegicus 


GDS3672 


Hypertension model: aorta 


Mus musculus 


GDS3690 


Atherosclerotic Coronary Artery Disease: circulating mononuclear cell types 


Homo sapiens 


GDS3715 


Insulin effect on skeletal muscle 


Homo sapiens 


GDS3716 


Breast cancer: histologically normal breast epithelium 


Homo sapiens 


GDS3703 


Addictive drugs effect on brain striatum: time course 


Mus musculus 


GDS3707 


Acute ethanol exposure: time course 


Drosophila melanogaster 


GDS3692 


Lean B5C-D7Mit353 strain: various tissues 


Mus musculus 



The 25 datasets obtained from the Gene Expression Omnibus (GEO) [25]. All datasets were retrieved between 4-04-2011 and 4-23-2011. Each dataset was log- 
transformed if it had not been already. For each dataset, four different correlation thresholds were used to build unweighted graphs. 



case the overlapping nature of maximum cliques. More 
broadly, these experiments underscore the importance of 
considering graph types when testing algorithms. 

It may be useful to examine graph size after applying 
MCC and ES, and compare to both the size of the origi- 
nal graph and the amount of reduction achieved by 
color preprocessing alone. Figures 6 and 7 depict 



original and reduced graph sizes for five graphs we ori- 
ginally tested. 

While MCC seems as if it should produce better 
results, in practice we find it not to be the case for two 
reasons. First, the vertices in an MCC may collectively 
be connected to a large portion of the rest of the graph, 
and so very little reduction in graph size takes place. 




(a) (b) 

Figure 4 iVIaximum Clique Sensitivity. The number of maximum cliques in a graph can be highly subject to perturbations due, for example, 
to noise. For example, a graph may contain a single maximum clique C representing a putative network of size k, along with any number of 
vertices connected to k - 2 vertices in C. In (a), there is a single maximum clique of size k = S, with "many" other vertices (only three are shown) 
connected to /< - 2 = 3 of its nodes. In (b), noise results in the removal of a single edge, creating many maximum cliques now of size k - ] =4. 
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Figure 5 Timings. Timings on various approaclies to MCE on tlie testbed of 100 biological graplns. Timings include all preprocessing, as well as 
the time to find the maximum clique size, where applicable. Runs were halted after 24 hours and deemed to have not been solved, as 
represented by those shown to take 86400 seconds. The graph instances are sorted first in order of runtimes for Basic Backtracking, then in 
order of runtimes for Intelligent Backtracking. This is a reasonable way to visualize the timings, though not perfect, since graphs that are difficult 
for one method may not be as difficult for another, hence the subsequent timings are not monotonic. 



And second, any reduction in graph size may be redun- 
dant with FPT-style preprocessing rules already in place. 

Contrast to random graphs 

It would have probably been fruitless to test and design 
our algorithms around random graphs. (Yet practi- 
tioners do just that with some regularity.) In fact it has 
long been observed that the topology of graphs derived 
from real relationships differs drastically from the 
Erdos-Renyi random graph model introduced in [26]. 



Attempts to characterize the properties of real data 
graphs have been made, such as the notion of scale-free 
graphs, in which the degrees of the vertices follow a 
power-law distribution [27]. While work to develop the 
scale-free model into a formal mathematical framework 
continues [28], there remains no generally accepted for- 
mal definition. More importantly, the scale-free model is 
an inadequate description of real data graphs. We have 
observed that constructing a graph so the vertices follow 
a power law (scale-free) degree distribution, but where 
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Figure 6 Reduction in Graph Size. Reduction in graph size thanks to preprocessing on five representative graphs chosen from our testbed. 
Each of the four preprocessing methods greatly reduces the graph size. 
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Figure 7 Reduction in Graph Size. A zoomed view of Figure 6, showing the effectiveness of each preprocessing method at reducing graph 
size. ES preprocessing results in the smallest reduced graph, often leaving only a small fraction of the vertices left by other methods. 



edges are placed randomly otherwise using the vertex 
degrees as relative probabilities for edge placement, still 
results in graphs with numerous small disjoint maxi- 
mum cliques. For instance, constructing graphs with the 
same degree distribution as each of the 75 biological 
graphs in our original testbed resulted in maximum cli- 
que sizes no greater than 5 for even the highest density 
graphs. Compare this to maximum clique sizes that ran- 
ged into hundreds of vertices in the corresponding bio- 
logical graphs. Other metrics have been introduced to 
attempt to define important properties, such as cluster 
coefficient and diameter. Collectively, however, such 
metrics remain inadequate to model fully the types of 
graphs derived from actual biological data. The notions 
of maximum clique cover and essential vertices stem 
from the observation that transcriptomic data graphs 
tend to have one very large highly-connected region, 
and most (very often all) of the maximum cliques lie in 
that space. Furthermore, there tends to be a great 
amount of overlap between maximum cliques, perhaps 
as a natural result of gene pleiotropism. Such overlap is 
key to the runtime improvement achieved by the ES 
algorithm. 

Future research directions 

Our efforts with MCE suggest a number of areas with 
potential for further investigation. A formal definition of 
the class of graphs for which ES achieves runtime 
improvements may lead to new theoretical complexity 
results, perhaps based upon parameterizing by the 
amount of maximum clique overlap. Furthermore, such a 
formal definition may form the basis of a new model for 



real data graphs. We have noted that the number of dis- 
joint maximum cliques that can be extracted provides an 
upper bound on the size of an MCC. If we parameterize 
by the maximum clique size and the number of maxi- 
mum cliques, does an FPT algorithm exist? In addition, 
formal mathematical results may be achieved on the sen- 
sitivity of the number of maximum cliques to small 
changes in the graph. 

Note that any MCC forms a hitting set over the set of 
maximum cliques, though not necessarily a minimum 
one. Also, a set D of disjoint maximum cliques, to 
which no additional disjoint maximum clique can be 
added, forms a subset cover over the set of all maximum 
cliques. That is, any maximum clique C € D contains at 
least one v & D. See Figure 8. To the best of our knowl- 
edge, this problem has not previously been studied. All 
we have found in the literature is one citation that erro- 
neously reported it to be one of Karp's original 
AAP -complete problems [29]. 

For the subset cover problem, we have noted that it is 
ATT' -hard by a simple reduction from hitting set. But in 
the context of MCE we have subsets all of the same size. 
It may be that this alters the complexity of the problem, 
or that one can achieve tighter complexity bounds when 
parameterizing by the subset size. Alternately, consider 
the problem of finding the minimum subset cover given 
a known minimum hitting set. The complexity of this 
tangential problem is not at all clear, although we conjec- 
ture it to be MV -complete in and of itself. Lastly, as a 
practical matter, exploring whether an algorithm that 
addresses the memory issues of the subset enumeration 
algorithm presented in [13] and improved in [14] may 
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Figure 8 The Subset Cover Problem. The decision version of the subset cover problem asks if there are k or fewer subsets that cover all other 
subsets. A satisfying solution for /c = 4 is the highlighted subsets. 



also prove fruitful. As we have found here, it may well 
depend at least in part on the data. 
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